﻿WEBVTT

00:00:11.374 --> 00:00:14.567
- Hello everyone, welcome to CS231.

00:00:14.567 --> 00:00:17.618
I'm Song Han. Today I'm
going to give a guest lecture

00:00:17.618 --> 00:00:21.468
on the efficient methods and
hardware for deep learning.

00:00:21.468 --> 00:00:24.714
So I'm a fifth year PhD
candidate here at Stanford,

00:00:24.714 --> 00:00:28.081
advised by Professor Bill Dally.

00:00:28.081 --> 00:00:31.093
So, in this course we have seen
a lot of convolution neural

00:00:31.093 --> 00:00:33.932
networks, recurrent
neural networks, or even

00:00:33.932 --> 00:00:37.358
since last time, the
reinforcement learning.

00:00:37.358 --> 00:00:39.281
They are spanning a lot of applications.

00:00:39.281 --> 00:00:41.979
For example, the self-=driving
car, machine translation,

00:00:41.979 --> 00:00:44.157
AlphaGo and Smart Robots.

00:00:44.157 --> 00:00:46.904
And it's changing our
lives, but there is a recent

00:00:46.904 --> 00:00:50.781
trend that in order to
achieve such high accuracy,

00:00:50.781 --> 00:00:53.652
the models are getting larger and larger.

00:00:53.652 --> 00:00:56.669
For example for ImageNet
recognition, the winner from

00:00:56.669 --> 00:01:00.502
2012 to 2015, the model
size increased by 16X.

00:01:02.519 --> 00:01:05.104
And just in one year,
for Baidu's deep speech

00:01:05.104 --> 00:01:07.809
just in one year, the training
operations, the number

00:01:07.809 --> 00:01:11.142
of training operations increased by 10X.

00:01:12.043 --> 00:01:15.651
So such large model
creates lots of problems,

00:01:15.651 --> 00:01:18.941
for example the model size
becomes larger and larger

00:01:18.941 --> 00:01:22.413
so it's difficult for
them to be deployed either

00:01:22.413 --> 00:01:25.159
on those for example,
on the mobile phones.

00:01:25.159 --> 00:01:28.232
If the item is larger
than 100 megabytes, you

00:01:28.232 --> 00:01:30.797
cannot download until
you connect to Wi-Fi.

00:01:30.797 --> 00:01:33.315
So those product managers
and for example Baidu,

00:01:33.315 --> 00:01:36.280
Facebook, they are very sensitive
to the size of the binary

00:01:36.280 --> 00:01:37.982
size of their model.

00:01:37.982 --> 00:01:40.358
And also for example, the
self-driving car, you can only

00:01:40.358 --> 00:01:43.743
do those on over-the-air
update for the model

00:01:43.743 --> 00:01:47.130
if the model is too large,
it's also difficult.

00:01:47.130 --> 00:01:51.958
And the second challenge
for those large models is

00:01:51.958 --> 00:01:55.272
that the training speed is extremely slow.

00:01:55.272 --> 00:01:58.930
For example, the ResNet152,
which is only a few, less

00:01:58.930 --> 00:02:03.713
than 1% actually, more
accurate than ResNet101.

00:02:03.713 --> 00:02:07.046
Takes 1.5 weeks to train on four Maxwell

00:02:08.839 --> 00:02:10.589
M40 GPUs for example.

00:02:11.703 --> 00:02:15.175
Which greatly limits either
we are doing homework

00:02:15.175 --> 00:02:17.422
or if the researcher's
designing new models is

00:02:17.422 --> 00:02:19.284
getting pretty slow.

00:02:19.284 --> 00:02:22.473
And the third challenge
for those bulky model is

00:02:22.473 --> 00:02:24.377
the energy efficiency.

00:02:24.377 --> 00:02:27.730
For example, the AlphaGo
beating Lee Sedol last year,

00:02:27.730 --> 00:02:31.563
took 2000 CPUs and 300
GPUs, which cost $3,000

00:02:33.090 --> 00:02:37.527
just to pay for the electric
bill, which is insane.

00:02:37.527 --> 00:02:39.968
So either on those embedded
devices, those models

00:02:39.968 --> 00:02:43.100
are draining your battery
power for on data-center

00:02:43.100 --> 00:02:46.548
increases the total cost
of ownership of maintaining

00:02:46.548 --> 00:02:48.215
a large data-center.

00:02:49.250 --> 00:02:51.678
For example, Google in
their blog, they mentioned

00:02:51.678 --> 00:02:55.118
if all the users using the
Google Voice Search for

00:02:55.118 --> 00:02:58.592
just three minutes, they have
to double their data-center.

00:02:58.592 --> 00:03:00.509
So that's a large cost.

00:03:01.766 --> 00:03:04.802
So reducing such cost is very important.

00:03:04.802 --> 00:03:08.356
And let's see where is
actually the energy consumed.

00:03:08.356 --> 00:03:11.024
The large model means
lots of memory access.

00:03:11.024 --> 00:03:14.060
You have to access, load
those models from the memory

00:03:14.060 --> 00:03:15.869
means more energy.

00:03:15.869 --> 00:03:19.541
If you look at how much
energy is consumed by loading

00:03:19.541 --> 00:03:23.708
the memory versus how much is
consumed by multiplications

00:03:24.852 --> 00:03:29.717
and add those arithmetic
operations, the memory access

00:03:29.717 --> 00:03:33.550
is more than two or three
orders of magnitude,

00:03:34.579 --> 00:03:38.746
more energy consuming than
those arithmetic operations.

00:03:40.191 --> 00:03:43.996
So how to make deep
learning more efficient.

00:03:43.996 --> 00:03:47.102
So we have to improve
energy efficiency by this

00:03:47.102 --> 00:03:49.852
Algorithm and Hardware Co-Design.

00:03:50.700 --> 00:03:53.090
So this is the previous
way, which is our hardware.

00:03:53.090 --> 00:03:57.257
For example, we have some
benchmarks say Spec 2006

00:03:58.510 --> 00:04:01.039
and then run those
benchmarks and tune your CPU

00:04:01.039 --> 00:04:03.956
architectures for those benchmarks.

00:04:06.015 --> 00:04:08.823
Now what we should do is
to open up the box to see

00:04:08.823 --> 00:04:11.620
what can we do from algorithm
side first and see what

00:04:11.620 --> 00:04:15.375
is the optimum question
mark processing unit.

00:04:15.375 --> 00:04:18.733
That breaks the
boundary between the algorithm

00:04:18.733 --> 00:04:22.316
hardware to improve
the overall efficiency.

00:04:26.017 --> 00:04:29.779
So today's talk, I'm going
to have the following agenda.

00:04:29.779 --> 00:04:33.910
We are going to cover four
aspects: The algorithm hardware

00:04:33.910 --> 00:04:36.071
and inference and training.

00:04:36.071 --> 00:04:40.817
So they form a small two by
two matrix, so includes the

00:04:40.817 --> 00:04:43.138
algorithm for efficient inference,

00:04:43.138 --> 00:04:45.291
hardware for efficient inference

00:04:45.291 --> 00:04:47.581
and the algorithm for efficient training,

00:04:47.581 --> 00:04:50.976
and lastly, the hardware
for efficient training.

00:04:50.976 --> 00:04:53.125
For example, I'm going
to cover the TPU, I'm

00:04:53.125 --> 00:04:54.609
going to cover the Volta.

00:04:54.609 --> 00:04:58.692
But before I cover those
things, let's have three

00:04:59.741 --> 00:05:02.443
slides for Hardware 101.

00:05:02.443 --> 00:05:05.180
A brief introduction of
the families of hardware

00:05:05.180 --> 00:05:06.430
in such a tree.

00:05:07.355 --> 00:05:11.955
So in general, we can
have roughly two branches.

00:05:11.955 --> 00:05:14.761
One is general purpose hardware.

00:05:14.761 --> 00:05:18.844
It can do any applications
versus the specialized

00:05:21.324 --> 00:05:25.249
hardware, which is tuned
for a specific kind of

00:05:25.249 --> 00:05:29.113
applications, a domain of applications.

00:05:29.113 --> 00:05:31.962
So the general purpose
hardware includes, the CPU

00:05:31.962 --> 00:05:35.621
or the GPU, and their
difference is that CPU is

00:05:35.621 --> 00:05:38.288
latency oriented, single threaded.

00:05:38.288 --> 00:05:40.451
It's like a big elephant.

00:05:40.451 --> 00:05:43.534
While the GPU is throughput oriented.

00:05:44.486 --> 00:05:46.846
It has many small though
weak threads, but there

00:05:46.846 --> 00:05:49.691
are thousands of such small weak cores.

00:05:49.691 --> 00:05:54.088
Like a group of small ants,
where there are so many ants.

00:05:54.088 --> 00:05:58.255
And specialized hardware,
roughly there are FPGAs and ASICs.

00:05:59.126 --> 00:06:03.274
So FPGA stand for Field
Programmable Gate Array.

00:06:03.274 --> 00:06:07.748
So it is programmable, hardware
programmable so its

00:06:07.748 --> 00:06:09.353
logic can be changed.

00:06:09.353 --> 00:06:13.520
So it's cheaper for you to try
new ideas and do prototype,

00:06:14.597 --> 00:06:16.262
but it's less efficient.

00:06:16.262 --> 00:06:18.185
It's in the middle between
the general purpose and

00:06:18.185 --> 00:06:19.018
pure ASIC.

00:06:19.965 --> 00:06:24.137
So ASIC stands for Application
Specific Integrated Circuit.

00:06:24.137 --> 00:06:25.842
It has a fixed logic, just designed

00:06:25.842 --> 00:06:27.293
for a certain application.

00:06:27.293 --> 00:06:29.341
For example deep learning.

00:06:29.341 --> 00:06:34.264
And Google's TPU is a kind of
ASIC and the neural networks

00:06:34.264 --> 00:06:37.852
we train on, the earlier GPUs is here.

00:06:37.852 --> 00:06:41.645
And another slide for
Hardware 101 is the number

00:06:41.645 --> 00:06:43.657
representations.

00:06:43.657 --> 00:06:47.473
So in this slide, I'm going
to convey you the idea that

00:06:47.473 --> 00:06:49.924
all the numbers in computer
are not represented

00:06:49.924 --> 00:06:51.742
by a real number.

00:06:51.742 --> 00:06:54.536
It's not a real number, but
they are actually discrete.

00:06:54.536 --> 00:06:57.977
Even for those floating
point with your 32 Bit.

00:06:57.977 --> 00:07:02.301
Floating point numbers, their
resolution is not perfect.

00:07:02.301 --> 00:07:06.336
It's not continuous, but it's discrete.

00:07:06.336 --> 00:07:10.271
So for example FP32, meaning
using a 32 bit to represent

00:07:10.271 --> 00:07:12.147
a floating point number.

00:07:12.147 --> 00:07:15.296
So there are three components
in the representation.

00:07:15.296 --> 00:07:18.907
The sign bit, the
exponent bit, the mantissa,

00:07:18.907 --> 00:07:23.682
and the number it represents
is shown by minus 1 to the S

00:07:23.682 --> 00:07:26.515
times 1.M times 2 to the exponent.

00:07:28.778 --> 00:07:32.745
So similar there is FP16,
using a 16 bit to represent

00:07:32.745 --> 00:07:34.745
a floating point number.

00:07:36.616 --> 00:07:39.375
In particular, I'm going
to introduce Int8, where

00:07:39.375 --> 00:07:43.692
the core TPU use, using an
integer to represent a fixed

00:07:43.692 --> 00:07:44.863
point number.

00:07:44.863 --> 00:07:47.912
So we have a certain number
of bits for the integer.

00:07:47.912 --> 00:07:50.827
Followed by a radix point,
if we put different layers.

00:07:50.827 --> 00:07:54.255
And lastly, the fractional bits.

00:07:54.255 --> 00:07:58.088
So why do we prefer those
eight bit, or 16 bit

00:07:59.257 --> 00:08:01.502
rather than those traditional like the

00:08:01.502 --> 00:08:03.844
32 bit floating point.

00:08:03.844 --> 00:08:04.856
That's the cost.

00:08:04.856 --> 00:08:08.981
So, I generated the figure
from 45 nanometer technology

00:08:08.981 --> 00:08:13.189
about the energy cost versus
the area cost for different

00:08:13.189 --> 00:08:14.635
operations.

00:08:14.635 --> 00:08:18.709
In particular, let's see
here, go you from 32 bit to

00:08:18.709 --> 00:08:22.876
16 bit, we have about four
times reduction in energy

00:08:24.066 --> 00:08:28.783
and also about four times
reduction in the area.

00:08:28.783 --> 00:08:30.966
Area means money.

00:08:30.966 --> 00:08:33.751
Every millimeter square takes
money to take out a chip

00:08:33.751 --> 00:08:38.592
So it's very beneficial for
hardware design to go from

00:08:38.592 --> 00:08:40.009
32 bit to 16 bit.

00:08:41.801 --> 00:08:45.968
That's why you hear NVIDIA
from Pascal Architecture,

00:08:46.894 --> 00:08:49.821
they said they're
starting to support FP16.

00:08:49.821 --> 00:08:53.915
That's the reason why it's so beneficial.

00:08:53.915 --> 00:08:57.122
For example, previous battery
level could last four hours,

00:08:57.122 --> 00:08:58.662
now it becomes 16 hours.

00:08:58.662 --> 00:09:00.269
That's what it means to reduce

00:09:00.269 --> 00:09:02.698
the energy cost by four times.

00:09:02.698 --> 00:09:07.160
But here still, there's a
problem of large energy costs

00:09:07.160 --> 00:09:08.297
for reading the memory.

00:09:08.297 --> 00:09:11.771
And let's see how can we deal
with this memory reference

00:09:11.771 --> 00:09:16.279
so expensive, how do we deal
with this problem better?

00:09:16.279 --> 00:09:19.913
So let's switch gear and
come to our topic directly.

00:09:19.913 --> 00:09:24.285
So let's first introduce
algorithm for efficient inference.

00:09:24.285 --> 00:09:27.919
So I'm going to cover six topics,
this is a really long slide.

00:09:27.919 --> 00:09:30.336
So I'm going to relatively fast.

00:09:31.796 --> 00:09:34.747
So the first idea I'm going
to talk about is pruning.

00:09:34.747 --> 00:09:36.767
Pruning the neural networks.

00:09:36.767 --> 00:09:39.671
For example, this is
original neural network.

00:09:39.671 --> 00:09:42.927
So what I'm trying to do is,
can we remove some of the

00:09:42.927 --> 00:09:46.260
weight and still have the same accuracy?

00:09:47.424 --> 00:09:49.026
It's like pruning a tree, get rid

00:09:49.026 --> 00:09:51.838
of those redundant connections.

00:09:51.838 --> 00:09:55.540
This is first proposed by
Professor Yann LeCun back in 1989,

00:09:55.540 --> 00:09:59.839
and I revisited this problem,
26 years later, on those

00:09:59.839 --> 00:10:03.933
modern deep neural nets
to see how it works.

00:10:03.933 --> 00:10:06.764
So not all parameters are useful actually.

00:10:06.764 --> 00:10:09.388
For example, in this case, if
you want to fit a single line,

00:10:09.388 --> 00:10:12.308
but you're using a quadratic
term, apparently the

00:10:12.308 --> 00:10:14.808
0.01 is a redundant parameter.

00:10:15.977 --> 00:10:18.174
So I'm going to train the
connectivity first and then

00:10:18.174 --> 00:10:20.611
prune some of the connections.

00:10:20.611 --> 00:10:22.384
And then train the remaining weights,

00:10:22.384 --> 00:10:24.364
and through this process, it regulates.

00:10:24.364 --> 00:10:28.663
And as a result, I can reduce
the number of connections,

00:10:28.663 --> 00:10:31.908
and annex that from 16
million parameters to only

00:10:31.908 --> 00:10:35.278
six million parameters,
which is 10 times less

00:10:35.278 --> 00:10:36.611
the computation.

00:10:37.645 --> 00:10:39.645
So this is the accuracy.

00:10:42.842 --> 00:10:46.224
So the x-axis is how much
parameters to prune away

00:10:46.224 --> 00:10:49.592
and the y-axis is the accuracy you have.

00:10:49.592 --> 00:10:53.180
So we want to have less
parameters, but we also

00:10:53.180 --> 00:10:55.834
want to have the same accuracy as before.

00:10:55.834 --> 00:10:58.424
We don't want to sacrifice accuracy,

00:10:58.424 --> 00:11:02.591
For example at 80%, we
locked zero away left 80%

00:11:04.255 --> 00:11:08.257
of the parameters, but
accuracy jumped by 4%.

00:11:08.257 --> 00:11:10.097
That's intolerable.

00:11:10.097 --> 00:11:12.535
But the good thing is that
if we retrain the remaining

00:11:12.535 --> 00:11:16.285
weights, the accuracy
can fully recover here.

00:11:18.020 --> 00:11:19.914
And if we do this process iteratively

00:11:19.914 --> 00:11:22.997
by pruning and retraining,
pruning and retraining,

00:11:22.997 --> 00:11:26.938
we can fully recover the
accuracy not until we are

00:11:26.938 --> 00:11:30.479
prune away 90% of the parameters.

00:11:30.479 --> 00:11:34.114
So if you go back to home
and try it on your Ipad

00:11:34.114 --> 00:11:38.314
or notebook, just zero away
50% of the parameters say

00:11:38.314 --> 00:11:41.118
you went on your homework,
you will astonishingly find

00:11:41.118 --> 00:11:44.118
that accuracy actually doesn't hurt.

00:11:45.087 --> 00:11:47.422
So we just mentioned
convolution neural nets,

00:11:47.422 --> 00:11:52.301
how about RNNs and LSTMs, so I
tried with this neural talk.

00:11:52.301 --> 00:11:55.637
Again, pruning away 90% of
the rates doesn't hurt the

00:11:55.637 --> 00:11:56.554
blue score.

00:11:58.385 --> 00:12:00.007
And here are some visualizations.

00:12:00.007 --> 00:12:04.401
For example, the original
picture, the neural talk says

00:12:04.401 --> 00:12:07.507
a basketball player in a
white uniform is playing

00:12:07.507 --> 00:12:08.710
with a ball.

00:12:08.710 --> 00:12:12.797
Versus pruning away 90% it
says, a basketball player

00:12:12.797 --> 00:12:16.775
in a white uniform is
playing with a basketball.

00:12:16.775 --> 00:12:18.192
And on and so on.

00:12:19.155 --> 00:12:23.157
But if you're too aggressive,
say you prune away

00:12:23.157 --> 00:12:27.324
95% of the weights, the
network is going to get drunk.

00:12:28.766 --> 00:12:32.355
It says, a man in a red shirt
and white and black shirt

00:12:32.355 --> 00:12:34.345
is running through a field.

00:12:34.345 --> 00:12:37.059
So there's really a limit,
a threshold, you have to

00:12:37.059 --> 00:12:39.726
take care of during the pruning.

00:12:41.095 --> 00:12:43.395
So interestingly, after
I did the work, did some

00:12:43.395 --> 00:12:45.788
resource and research and
find actually the same

00:12:45.788 --> 00:12:49.524
pruning procedure actually
happens to human brain

00:12:49.524 --> 00:12:50.357
as well.

00:12:50.357 --> 00:12:54.459
So when we were born, there
are about 50 trillion synapses

00:12:54.459 --> 00:12:55.688
in the brain.

00:12:55.688 --> 00:13:00.162
And at one year old, this number
surged into 1,000 trillion.

00:13:00.162 --> 00:13:04.329
And as we become adolescent,
it becomes smaller actually,

00:13:05.201 --> 00:13:09.368
500 trillion in the end,
according to the study by Nature.

00:13:11.803 --> 00:13:13.459
So this is very interesting.

00:13:13.459 --> 00:13:15.966
And also, the pruning changed
the weight distribution

00:13:15.966 --> 00:13:18.957
because we are removing
those small connections

00:13:18.957 --> 00:13:22.027
and after we retrain them,
that's why it becomes soft

00:13:22.027 --> 00:13:22.944
in the end.

00:13:23.939 --> 00:13:25.570
Yeah, question.

00:13:25.570 --> 00:13:26.781
- [Student] Are you trying
to mean that it terms

00:13:26.781 --> 00:13:29.901
of your mixed weights
during the training will be

00:13:29.901 --> 00:13:32.259
just set at zero and
just start from scratch?

00:13:32.259 --> 00:13:35.386
And these start from the
things that are at zero.

00:13:35.386 --> 00:13:37.411
- Yeah. So the question is,
how do we deal with those

00:13:37.411 --> 00:13:39.435
zero connections?

00:13:39.435 --> 00:13:43.602
So we force them to be zero
in all the other iterations.

00:13:45.369 --> 00:13:46.427
Question?

00:13:46.427 --> 00:13:50.153
- [Student] How do you
pick which rates to drop?

00:13:50.153 --> 00:13:53.293
- Yeah so very simple. Small
weights, drop it, sort it.

00:13:53.293 --> 00:13:54.421
If it's small, just--

00:13:54.421 --> 00:13:55.709
- [Student] Any threshold that I decide?

00:13:55.709 --> 00:13:57.042
- Exactly, yeah.

00:13:59.058 --> 00:14:01.929
So the next idea, weight sharing.

00:14:01.929 --> 00:14:05.574
So now we have, remember
our end goal is to remove

00:14:05.574 --> 00:14:09.703
connections so that we can
have less memory footprint

00:14:09.703 --> 00:14:12.567
so that we can have more
energy efficient deployment.

00:14:12.567 --> 00:14:15.361
Now we have less number
of parameters by pruning.

00:14:15.361 --> 00:14:19.446
We want to have less number
of bits per parameter

00:14:19.446 --> 00:14:23.204
so they're multiplied together
they get a small model.

00:14:23.204 --> 00:14:25.287
So the idea is like this.

00:14:26.267 --> 00:14:28.445
Not all numbers, not all the weights

00:14:28.445 --> 00:14:30.977
has to be the exact number.

00:14:30.977 --> 00:14:35.144
For example, 2.09, 2.12 or
all these four weights, you

00:14:36.725 --> 00:14:39.867
just put them using 2.0 to represent them.

00:14:39.867 --> 00:14:41.278
That's enough.

00:14:41.278 --> 00:14:45.445
Otherwise too accurate number
is just leads to overfitting.

00:14:46.851 --> 00:14:50.227
So the idea is I can
cluster the weights if they

00:14:50.227 --> 00:14:53.278
are similar, just using
a centroid to represent

00:14:53.278 --> 00:14:57.558
the number instead of using
the full precision weight.

00:14:57.558 --> 00:15:01.094
So that every time I do the
inference, I just do inference

00:15:01.094 --> 00:15:03.417
on this single number.

00:15:03.417 --> 00:15:06.995
For example, this is a
four by four weight matrix

00:15:06.995 --> 00:15:09.027
in a certain layer.

00:15:09.027 --> 00:15:12.715
And what I'm going to do is do
k-means clustering by having

00:15:12.715 --> 00:15:15.496
the similar weight
sharing the same centroid.

00:15:15.496 --> 00:15:19.364
For example, 2.09, 2.12, I store index of

00:15:19.364 --> 00:15:21.987
three pointing to here.

00:15:21.987 --> 00:15:25.529
So that, the good thing is
we need to only store the

00:15:25.529 --> 00:15:29.638
two bit index rather than the
32 bit, floating point number.

00:15:29.638 --> 00:15:31.555
That's 16 times saving.

00:15:34.577 --> 00:15:37.257
And how do we train such neural network?

00:15:37.257 --> 00:15:41.424
They are binded together, so
after we get the gradient,

00:15:42.372 --> 00:15:45.540
we color them in the same
pattern as the weight

00:15:45.540 --> 00:15:48.354
and then we do a group by
operation by having all

00:15:48.354 --> 00:15:52.604
the in that weights with the
same index grouped together.

00:15:52.604 --> 00:15:56.034
And then we do a reduction
by summing them up.

00:15:56.034 --> 00:15:58.106
And then multiplied by the learning rate

00:15:58.106 --> 00:16:00.404
subtracted from the original centroid.

00:16:00.404 --> 00:16:04.321
That's one iteration of
the SGD for such weight

00:16:05.292 --> 00:16:07.125
shared neural network.

00:16:08.613 --> 00:16:10.826
So remember previously,
after pruning this is

00:16:10.826 --> 00:16:14.409
what the weight
distribution like and after

00:16:16.164 --> 00:16:18.575
weight sharing, they become discrete.

00:16:18.575 --> 00:16:21.215
There are only 16 different
values here, meaning

00:16:21.215 --> 00:16:25.048
we can use four bits to
represent each number.

00:16:26.476 --> 00:16:29.764
And by training on such
weight shared neural network,

00:16:29.764 --> 00:16:31.986
training on such extremely
shared neural network,

00:16:31.986 --> 00:16:34.756
these weights can adjust.

00:16:34.756 --> 00:16:39.146
It is the subtle changes
that compensated for the

00:16:39.146 --> 00:16:40.563
loss of accuracy.

00:16:41.407 --> 00:16:44.914
So let's see, this is the
number of bits we give it,

00:16:44.914 --> 00:16:48.581
this is the accuracy
for convolution layers.

00:16:50.095 --> 00:16:54.884
Not until four bits, does
the accuracy begin to drop

00:16:54.884 --> 00:16:59.073
and for those fully connected
layers, very astonishingly,

00:16:59.073 --> 00:17:02.014
it's not until two bits,
only four number, does the

00:17:02.014 --> 00:17:03.702
accuracy begins to drop.

00:17:03.702 --> 00:17:06.119
And this result is per layer.

00:17:08.470 --> 00:17:12.404
So we have covered two methods,
pruning and weight sharing.

00:17:12.404 --> 00:17:15.433
What if we combine these
two methods together.

00:17:15.433 --> 00:17:16.982
Do they work well?

00:17:16.982 --> 00:17:20.444
So by combining those methods,
this is the compression

00:17:20.444 --> 00:17:22.814
ratio with the smaller on the left.

00:17:22.814 --> 00:17:24.684
And this is the accuracy.

00:17:24.684 --> 00:17:27.382
We can combine it together
and make the model

00:17:27.382 --> 00:17:32.364
about 3% of its original
size without hurting the

00:17:32.364 --> 00:17:33.804
accuracy at all.

00:17:33.804 --> 00:17:36.481
Compared with the each
working individual data by

00:17:36.481 --> 00:17:39.492
10%, accuracy begins to drop.

00:17:39.492 --> 00:17:41.742
And compared with the
cheap SVD method,

00:17:41.742 --> 00:17:44.742
this has a better compression ratio.

00:17:46.742 --> 00:17:50.650
And final idea is we can
apply the Huffman Coding

00:17:50.650 --> 00:17:55.031
to use more number of bits
for those infrequent numbers,

00:17:55.031 --> 00:17:59.061
infrequently appearing weights
and less number of bits

00:17:59.061 --> 00:18:03.351
for those more frequently
appearing weights.

00:18:03.351 --> 00:18:06.469
So by combining these three
methods, pruning, weight

00:18:06.469 --> 00:18:09.709
sharing, and also Huffman
Coding, we can compress the

00:18:09.709 --> 00:18:13.490
neural networks, state-of-the-art
neural networks,

00:18:13.490 --> 00:18:17.073
ranging from 10x to
49x without hurting the

00:18:20.159 --> 00:18:21.370
prediction accuracy.

00:18:21.370 --> 00:18:23.267
Sometimes a little bit better.

00:18:23.267 --> 00:18:25.948
But maybe that is noise.

00:18:25.948 --> 00:18:30.115
So the next question is, these
models are just pre-trained

00:18:31.069 --> 00:18:33.509
models by say Google, Microsoft.

00:18:33.509 --> 00:18:37.479
Can we make a compact
model, a pump compact model

00:18:37.479 --> 00:18:38.457
to begin with?

00:18:38.457 --> 00:18:40.874
Even before such compression?

00:18:42.297 --> 00:18:47.098
So SqueezeNet, you may have
already worked with this

00:18:47.098 --> 00:18:50.015
neural network model in a homework.

00:18:50.978 --> 00:18:55.145
So the idea is we are having
a squeeze layer here to shield

00:18:58.639 --> 00:19:01.198
at the three by three
convolution with fewer number of

00:19:01.198 --> 00:19:02.031
channels.

00:19:03.669 --> 00:19:06.177
So that's where squeeze comes from.

00:19:06.177 --> 00:19:10.119
And here we have two branches,
rather than four branches

00:19:10.119 --> 00:19:12.286
as in the inception model.

00:19:13.919 --> 00:19:16.668
So as a result, the model
is extremely compact.

00:19:16.668 --> 00:19:19.370
It doesn't have any
fully connected layers.

00:19:19.370 --> 00:19:20.978
Everything is fully convolutional.

00:19:20.978 --> 00:19:23.895
The last layer is a global pooling.

00:19:27.338 --> 00:19:31.698
So what if we apply deep
compression algorithm

00:19:31.698 --> 00:19:35.738
on such already compact
model will it be getting even

00:19:35.738 --> 00:19:36.571
smaller?

00:19:38.069 --> 00:19:42.389
So this is AlexNet after
compression, this is SqueezeNet.

00:19:42.389 --> 00:19:46.556
Even before compression, it's
50x smaller than AlexNet,

00:19:47.498 --> 00:19:49.638
but has the same accuracy.

00:19:49.638 --> 00:19:53.805
After compression 510x
smaller, but the same accuracy

00:19:56.093 --> 00:19:58.676
only less than half a megabyte.

00:20:00.444 --> 00:20:03.544
This means it's very easy
to fit such a small model

00:20:03.544 --> 00:20:07.705
on the cache, which is literally

00:20:07.705 --> 00:20:09.538
tens of megabyte SRAM.

00:20:11.407 --> 00:20:12.865
So what does it mean?

00:20:12.865 --> 00:20:15.412
It's possible to achieve speed up.

00:20:15.412 --> 00:20:18.964
So this is the speedup, I
measured if all these fully

00:20:18.964 --> 00:20:23.131
connected layers only for
now, on the CPU, GPU, and

00:20:24.447 --> 00:20:26.601
the mobile GPU, before pruning

00:20:26.601 --> 00:20:28.839
and after pruning the weights,

00:20:28.839 --> 00:20:33.081
and on average, I observed
a 3x speedup in a CPU,

00:20:33.081 --> 00:20:35.409
about 3X speedup on the GPU,

00:20:35.409 --> 00:20:39.151
and roughly 5x speedup on
the mobile GPU, which is a

00:20:39.151 --> 00:20:39.984
TK1.

00:20:41.511 --> 00:20:44.679
And so is the energy efficiency.

00:20:44.679 --> 00:20:49.528
In an average improvement
from 3x to 6x on a CPU, GPU,

00:20:49.528 --> 00:20:50.778
and mobile GPU.

00:20:52.209 --> 00:20:55.876
And these ideas are
used in these companies.

00:20:57.998 --> 00:21:00.391
Having talked about when
pruning and when sharing,

00:21:00.391 --> 00:21:02.791
which is a non-linear quantization method

00:21:02.791 --> 00:21:05.598
and we're going to talk about
quantization, which is, why

00:21:05.598 --> 00:21:08.479
do they use in the TPU design?

00:21:08.479 --> 00:21:12.671
All the TPU designs use at
only eight bit for inference.

00:21:12.671 --> 00:21:15.729
And the way, how they can
use that is because of the

00:21:15.729 --> 00:21:16.749
quantization.

00:21:16.749 --> 00:21:19.332
And let's see how does it work.

00:21:20.248 --> 00:21:24.968
So quantization has this
complicated figure, but

00:21:24.968 --> 00:21:26.769
the intuition is very simple.

00:21:26.769 --> 00:21:30.351
You run the neural network
and train it with the normal

00:21:30.351 --> 00:21:32.268
floating point numbers.

00:21:33.849 --> 00:21:37.677
And quantize the weight
and activations by gather

00:21:37.677 --> 00:21:39.700
the statistics for each layer.

00:21:39.700 --> 00:21:42.860
For example, what is the maximum number,
minimum number,

00:21:42.860 --> 00:21:44.863
and how many bits are enough

00:21:44.863 --> 00:21:47.511
to represent this dynamic range.

00:21:47.511 --> 00:21:51.892
Then you use that number of
bits for the integer part

00:21:51.892 --> 00:21:54.201
and the rest of the eight bit or seven bit

00:21:54.201 --> 00:21:58.118
for the other part of
the 8 bit representation.

00:22:00.241 --> 00:22:05.041
And also we can fine tune in
the floating point format.

00:22:05.041 --> 00:22:08.281
Or we can also use feed
forward with fixed point

00:22:08.281 --> 00:22:11.509
and back propagation with
update with the floating

00:22:11.509 --> 00:22:12.489
point number.

00:22:12.489 --> 00:22:17.391
There are lots of different
ideas to have better accuracy.

00:22:17.391 --> 00:22:21.409
And this is the result,
for how many number of bits

00:22:21.409 --> 00:22:23.121
versus what is the accuracy.

00:22:23.121 --> 00:22:26.020
For example, using a fixed,
8 bit, the accuracy for

00:22:26.020 --> 00:22:28.871
GoogleNet doesn't drop significantly.

00:22:28.871 --> 00:22:33.057
And for VGG-16, it also
remains pretty well for

00:22:33.057 --> 00:22:34.100
the accuracy.

00:22:34.100 --> 00:22:36.763
While circling down to
a six bit, the accuracy

00:22:36.763 --> 00:22:39.680
begins to drop pretty dramatically.

00:22:41.641 --> 00:22:44.474
Next idea, low rank approximation.

00:22:47.500 --> 00:22:51.083
It turned out that for
a convolution layer,

00:22:51.951 --> 00:22:55.949
you can break it into
two convolution layers.

00:22:55.949 --> 00:22:59.521
One convolution here, followed
by a one by one convolution.

00:22:59.521 --> 00:23:02.441
So that it's like you
break a complicated problem

00:23:02.441 --> 00:23:05.380
into two separate small problems.

00:23:05.380 --> 00:23:07.401
This is for convolution layer.

00:23:07.401 --> 00:23:10.292
As we can see, achieving about

00:23:10.292 --> 00:23:14.641
2x speedup, there's almost
no loss of accuracy.

00:23:14.641 --> 00:23:18.529
And achieving a speedup
of 5x, roughly a 6%

00:23:18.529 --> 00:23:19.946
loss of accuracy.

00:23:21.260 --> 00:23:24.020
And this also works for
fully connected layers.

00:23:24.020 --> 00:23:28.110
The simplest idea is using
the SVD to break it into

00:23:28.110 --> 00:23:30.721
one matrix into two matrices.

00:23:30.721 --> 00:23:34.888
And follow this idea, this
paper proposes to use the

00:23:36.121 --> 00:23:40.940
Tensor Tree to break down one
fully connected layer into

00:23:40.940 --> 00:23:43.631
a tree, lots of fully connected layers.

00:23:43.631 --> 00:23:46.131
That's why it's called a tree.

00:23:49.001 --> 00:23:52.191
So going even more crazy, can we use only

00:23:52.191 --> 00:23:56.671
two weights or three weights
to represent a neural network?

00:23:56.671 --> 00:23:59.601
A ternary weight or a binary weight.

00:23:59.601 --> 00:24:02.531
We already seen this distribution
before, after pruning.

00:24:02.531 --> 00:24:04.911
There's some positive
weights and negative weights.

00:24:04.911 --> 00:24:08.791
Can we just use three numbers,
just use one, minus one, zero

00:24:08.791 --> 00:24:12.081
to represent the neural network.

00:24:12.081 --> 00:24:16.452
This is our recent paper
clear that we maintain

00:24:16.452 --> 00:24:20.852
a full precision weight
during training time,

00:24:20.852 --> 00:24:24.292
but at inference time, we
only keep the scaling factor

00:24:24.292 --> 00:24:26.063
and the ternary weight.

00:24:26.063 --> 00:24:30.831
So during inference, we
only need three weights.

00:24:30.831 --> 00:24:35.831
That's very efficient and
making the model very small.

00:24:35.831 --> 00:24:38.332
This is the proportion
of the positive zero

00:24:38.332 --> 00:24:41.700
and negative weights, they can
change during the training.

00:24:41.700 --> 00:24:44.200
So is their absolute value.

00:24:46.092 --> 00:24:50.236
And this is the visualization of kernels

00:24:50.236 --> 00:24:53.809
by this trained ternary quantization.

00:24:53.809 --> 00:24:57.976
We can see some of them are
a corner detector like here.

00:24:59.336 --> 00:25:00.986
And also here.

00:25:00.986 --> 00:25:03.856
Some of them are maybe edge detector.

00:25:03.856 --> 00:25:06.107
For example, this filter some of them

00:25:06.107 --> 00:25:09.249
are corner detector like here this filter.

00:25:09.249 --> 00:25:12.537
Actually we don't need
such fine grain resolution.

00:25:12.537 --> 00:25:15.168
Just three weights are enough.

00:25:15.168 --> 00:25:19.335
So this is the validation
accuracy on ImageNet with AlexNet.

00:25:21.318 --> 00:25:24.238
So the threshline is the baseline accuracy

00:25:24.238 --> 00:25:26.529
with floating point 32.

00:25:26.529 --> 00:25:29.112
And the red line is our result.

00:25:29.979 --> 00:25:34.979
Pretty much the same accuracy
converged compared with

00:25:34.979 --> 00:25:37.229
the full precision weights.

00:25:40.390 --> 00:25:43.307
Last idea, Winograd Transformation.

00:25:44.470 --> 00:25:47.491
So this about how do we
implement deep neural nets,

00:25:47.491 --> 00:25:50.001
how do we implement the convolutions.

00:25:50.001 --> 00:25:52.430
So this is the conventional direct

00:25:52.430 --> 00:25:55.190
convolution implementation method.

00:25:55.190 --> 00:25:58.459
The slide credited to
Julien, a friend from Nvidia.

00:25:58.459 --> 00:26:01.959
So originally, we just do the element wise

00:26:03.298 --> 00:26:06.390
do a dot product for those
nine elements in the filter

00:26:06.390 --> 00:26:10.310
and nine elements in the
image and then sum it up.

00:26:10.310 --> 00:26:15.179
For example, for every
output we need nine times C

00:26:15.179 --> 00:26:18.012
number of multiplication and adds.

00:26:19.314 --> 00:26:23.481
Winograd Convolution is another
method, equivalent method.

00:26:27.444 --> 00:26:31.491
It's not lost, it's an
equivalent method proposed at

00:26:31.491 --> 00:26:33.531
first through this paper, Fast Algorithms

00:26:33.531 --> 00:26:35.334
for Convolution Neural Networks.

00:26:35.334 --> 00:26:38.212
That instead of directly
doing the convolution, move

00:26:38.212 --> 00:26:42.379
it one by one, at first it
transforms the input feature

00:26:43.905 --> 00:26:46.155
map to another feature map.

00:26:47.066 --> 00:26:51.233
Which contains only the
weight, contains only 1, 0.5, 2

00:26:53.396 --> 00:26:56.813
that can efficiently
implement it with shift.

00:26:56.813 --> 00:27:00.980
And also transform the filter
into a four by four tensor.

00:27:02.076 --> 00:27:06.324
So what we are going to do here
is sum over c and do an element-wise

00:27:06.324 --> 00:27:07.824
element-wise product.

00:27:08.964 --> 00:27:13.564
So there are only 16
multiplications happening here.

00:27:13.564 --> 00:27:18.356
And then we do a inverse
transform to get four outputs.

00:27:18.356 --> 00:27:21.175
So the transform and the
inverse transform can be

00:27:21.175 --> 00:27:24.932
amortized and the multiplications,
whether it can ignored.

00:27:24.932 --> 00:27:29.099
So in order to get four output,
we need nine times channel

00:27:30.524 --> 00:27:34.444
times four, which is 36 times channel.

00:27:34.444 --> 00:27:39.093
Multiplications originally
for the direct convolution

00:27:39.093 --> 00:27:42.676
but now we need 16
times C of our output

00:27:46.655 --> 00:27:50.822
So that is 2.25x less
number of multiplications to

00:27:53.916 --> 00:27:57.083
perform the exact same multiplication.

00:27:58.306 --> 00:27:59.807
And here is a speedup.

00:27:59.807 --> 00:28:03.974
2.25x, so theoretically,
2.25x speedup and in real,

00:28:07.694 --> 00:28:10.611
from cuDNN 5 they incorporated such

00:28:11.570 --> 00:28:14.916
Winograd Convolution algorithm.

00:28:14.916 --> 00:28:19.234
This is on the VGG net I
believe, the speedup is

00:28:19.234 --> 00:28:21.401
roughly 1.7 to 2x speedup.

00:28:23.735 --> 00:28:25.318
Pretty significant.

00:28:27.314 --> 00:28:31.147
And after cuDNN 5, the
cuDNN begins to use the

00:28:33.564 --> 00:28:36.147
Winograd Convolution algorithm.

00:28:38.586 --> 00:28:43.354
Okay, so far we have covered
those efficient algorithms

00:28:43.354 --> 00:28:45.666
for efficient inference.

00:28:45.666 --> 00:28:48.978
We covered pruning, weight
sharing, quantization,

00:28:48.978 --> 00:28:52.061
and also Winograd binary and ternary.

00:28:53.436 --> 00:28:57.603
So now let's see what is the
optimal hardware for those

00:28:59.196 --> 00:29:00.805
efficient inference?

00:29:00.805 --> 00:29:02.888
And what is a Google TPU?

00:29:05.018 --> 00:29:08.685
So there are a wide
range of domain specific

00:29:09.567 --> 00:29:14.286
architectures or ASICS
for deep neural networks.

00:29:14.286 --> 00:29:16.745
They have a common goal
is to minimize the memory

00:29:16.745 --> 00:29:18.495
access to save power.

00:29:20.595 --> 00:29:24.708
For example the Eyeriss from
MIT by using the RS Dataflow

00:29:24.708 --> 00:29:28.028
to minimize the off chip direct access.

00:29:28.028 --> 00:29:30.818
And DaDiannao from China
Academy of Science,

00:29:30.818 --> 00:29:33.906
buffered all the weights on
chip DRAM instead of having

00:29:33.906 --> 00:29:35.823
to go to off-chip DRAM.

00:29:37.287 --> 00:29:42.108
So the TPU from Google is
using eight bit integer

00:29:42.108 --> 00:29:44.258
to represent the numbers.

00:29:44.258 --> 00:29:47.319
And at Stanford I proposed
the EIE architecture

00:29:47.319 --> 00:29:49.496
that support those compressed and

00:29:49.496 --> 00:29:53.067
sparse deep neural network inference.

00:29:53.067 --> 00:29:56.668
So this is what the TPU looks like.

00:29:56.668 --> 00:30:00.835
It's actually smartly, can
be put into the disk drive

00:30:03.267 --> 00:30:06.267
up to four cards per server.

00:30:06.267 --> 00:30:09.039
And this is the high-level architecture

00:30:09.039 --> 00:30:10.622
for the Google TPU.

00:30:12.386 --> 00:30:17.239
Don't be overwhelmed, it's
actually, the kernel part

00:30:17.239 --> 00:30:21.156
here, is this giant matrix
multiplication unit.

00:30:23.218 --> 00:30:27.218
So it's a 256 by 256
matrix multiplication unit.

00:30:28.698 --> 00:30:32.531
So in one single cycle,
it can perform 64 kilo

00:30:37.177 --> 00:30:41.028
those number of multiplication
and accumulate operations.

00:30:41.028 --> 00:30:44.861
So running 700 Megahertz,
the throughput is 92

00:30:47.708 --> 00:30:49.208
Teraops per second

00:30:52.380 --> 00:30:55.319
because it's actually integer operation.

00:30:55.319 --> 00:30:59.486
So we just about 25x as GPU
and more than 100x at the CPU.

00:31:01.799 --> 00:31:05.966
And notice, TPU has a really
large software-managed

00:31:07.711 --> 00:31:09.541
on-chip buffer.

00:31:09.541 --> 00:31:11.124
It is 24 megabytes.

00:31:13.550 --> 00:31:18.375
The cache for the CPU the
L3 cache is already

00:31:18.375 --> 00:31:19.720
16 megabytes.

00:31:19.720 --> 00:31:24.093
This is 24 megabytes
which is pretty large.

00:31:24.093 --> 00:31:28.453
And it's powered by
two DDR3 DRAM channels.

00:31:28.453 --> 00:31:32.536
So this is a little weak
because the bandwidth is

00:31:33.783 --> 00:31:37.950
only 30 gigabytes per second
compared with the most

00:31:39.151 --> 00:31:42.984
recent GPU that HBM, 900
Gigabytes per second.

00:31:47.543 --> 00:31:51.751
The DDR4 is released in 2014,
so that makes sense because

00:31:51.751 --> 00:31:55.493
the design is a little during
that day, used the DDR3.

00:31:55.493 --> 00:32:00.391
But if you're using DDR4 or
even high-bandwidth memory,

00:32:00.391 --> 00:32:03.391
the performance can be even boosted.

00:32:05.011 --> 00:32:08.303
So this is a comparison
about Google's TPU compared

00:32:08.303 --> 00:32:12.470
with the CPU, GPU of this K80
GPU by the way, and the TPU.

00:32:15.800 --> 00:32:19.743
So the area is pretty much
smaller, like half the size of a

00:32:19.743 --> 00:32:23.910
CPU and GPU and the power
consumption is roughly 75 watts.

00:32:28.562 --> 00:32:32.562
And see this number, the
peak teraops per second

00:32:33.482 --> 00:32:38.103
is much higher than the
CPU and GPU is, about 90

00:32:38.103 --> 00:32:41.520
teraops per second, which is pretty high.

00:32:42.602 --> 00:32:44.922
So here is a workload.

00:32:44.922 --> 00:32:47.983
Thanks to David sharing the slide.

00:32:47.983 --> 00:32:51.060
This is the workload at Google.

00:32:51.060 --> 00:32:54.380
They did a benchmark on these TPUs.

00:32:54.380 --> 00:32:58.804
So it's a little interesting
that convolution neural nets

00:32:58.804 --> 00:33:03.711
only account for 5% of
data-center workload.

00:33:03.711 --> 00:33:06.860
Most of them is multilayer perception,

00:33:06.860 --> 00:33:08.329
those fully connected layers.

00:33:08.329 --> 00:33:12.569
About 61% maybe for ads, I'm not sure.

00:33:12.569 --> 00:33:17.058
And about 29% of the workload
in data-center is the

00:33:17.058 --> 00:33:18.369
Long Short Term Memory.

00:33:18.369 --> 00:33:20.391
For example, speech recognition,

00:33:20.391 --> 00:33:23.224
or machine translation, I suspect.

00:33:28.475 --> 00:33:31.129
Remember just now we have seen there are

00:33:31.129 --> 00:33:33.569
90 teraops per second.

00:33:33.569 --> 00:33:37.671
But what actually number
of teraops per second

00:33:37.671 --> 00:33:39.239
can be achieved?

00:33:39.239 --> 00:33:43.449
This is a basic tool to
measure the bottleneck

00:33:43.449 --> 00:33:45.688
of a computer system.

00:33:45.688 --> 00:33:49.647
Whether you are bottlenecked
by the arithmetic or

00:33:49.647 --> 00:33:53.267
you are bottlenecked by
the memory bandwidth.

00:33:53.267 --> 00:33:54.817
It's like if you have a bucket,

00:33:54.817 --> 00:33:58.548
the lowest part of the
bucket determines how much

00:33:58.548 --> 00:34:01.087
water we can hold in the bucket.

00:34:01.087 --> 00:34:04.337
So in this region, you are bottlenecked

00:34:05.927 --> 00:34:07.977
by the memory bandwidth.

00:34:07.977 --> 00:34:11.477
So the x-axis is the arithmetic intensity.

00:34:13.945 --> 00:34:18.112
Which is number of floating
point operations per byte

00:34:19.745 --> 00:34:22.415
the ratio between the
computation and memory

00:34:22.415 --> 00:34:24.248
of bandwidth overhead.

00:34:26.047 --> 00:34:30.214
So the y-axis, is the actual
attainable performance.

00:34:32.967 --> 00:34:36.664
Here is the peak performance for example.

00:34:36.664 --> 00:34:40.116
When you do a lot of operation
after you fetch a single

00:34:40.116 --> 00:34:42.574
piece of data, if you
can do a lot of operation

00:34:42.574 --> 00:34:46.995
on top of it, then you are
bottlenecked by the arithmetic.

00:34:46.996 --> 00:34:51.714
But after you fetch a lot
of data from the memory,

00:34:51.714 --> 00:34:55.916
but you just do a tiny
little bit of arithmetic,

00:34:55.916 --> 00:35:00.054
then you will be bottlenecked
by the memory bandwidth.

00:35:00.054 --> 00:35:04.704
So how much you can fetch
from the memory determines

00:35:04.704 --> 00:35:08.214
how much real performance you can get.

00:35:08.214 --> 00:35:10.065
And remember there is a ratio.

00:35:10.065 --> 00:35:15.047
When it is one here, this
region it happens to be the same

00:35:15.047 --> 00:35:17.854
as the turning point is the actual

00:35:17.854 --> 00:35:20.521
memory bandwidth of your system.

00:35:21.476 --> 00:35:24.407
So let's see what is the life for the TPU.

00:35:24.407 --> 00:35:26.825
The TPU's peak performance is really high,

00:35:26.825 --> 00:35:28.908
about 90 Tops per second.

00:35:30.623 --> 00:35:34.790
For those convolution nets,
they are pretty much saturating

00:35:39.915 --> 00:35:41.825
the peak performance.

00:35:41.825 --> 00:35:45.644
But there are lot of neural
networks that has a utlitization

00:35:45.644 --> 00:35:47.227
less than 10%,

00:35:49.905 --> 00:35:53.572
meaning that 90 T-ops
per second is actually

00:35:54.985 --> 00:35:59.152
achieves about three to 12
T-ops per second in real case.

00:36:03.244 --> 00:36:05.185
But why is it like that?

00:36:05.185 --> 00:36:09.352
The reason is, in order to
have those real-time guarantee

00:36:10.882 --> 00:36:14.691
that the user not wait for
too long, you cannot batch

00:36:14.691 --> 00:36:18.002
a lot of user's images
or speech voice data

00:36:18.002 --> 00:36:19.354
at the same time.

00:36:19.354 --> 00:36:22.811
So as a result, for those
fully connect layers,

00:36:22.811 --> 00:36:26.978
they have very little reuse,
so they are bottlenecked

00:36:28.634 --> 00:36:31.453
by the memory bandwidth.

00:36:31.453 --> 00:36:35.584
For those convolution neural
nets, for example this one,

00:36:35.584 --> 00:36:39.417
this blue one, that
achieve 86, which is CNN0.

00:36:42.333 --> 00:36:44.750
The ratio between the ops and

00:36:48.632 --> 00:36:51.872
the number of memory is the highest.

00:36:51.872 --> 00:36:56.039
It's pretty high, more than
2,000 compared with other

00:36:57.722 --> 00:37:00.722
multilayer perceptron or
long short term memory

00:37:00.722 --> 00:37:02.722
the ratio is pretty low.

00:37:04.389 --> 00:37:08.556
So this figure compares, this
is the TPU and this one is

00:37:09.682 --> 00:37:11.765
the CPU, this is the GPU.

00:37:13.021 --> 00:37:16.352
Here is memory bandwidth,
the peak memory bandwidth

00:37:16.352 --> 00:37:17.792
at a ratio of one here.

00:37:17.792 --> 00:37:20.538
So TPU has the highest memory bandwidth.

00:37:20.538 --> 00:37:24.402
And here is where are
these neural networks

00:37:24.402 --> 00:37:26.072
lie on this curve.

00:37:26.072 --> 00:37:28.538
So the asterisk is for the TPU.

00:37:28.538 --> 00:37:31.371
It's still higher than other dots,

00:37:32.890 --> 00:37:37.057
but if you're not comfortable
with this log scale figure,

00:37:38.232 --> 00:37:42.399
this is what it's like
putting it in linear roofline.

00:37:43.781 --> 00:37:46.819
So pretty much everything
disappeared except

00:37:46.819 --> 00:37:48.486
for the TPU results.

00:37:51.562 --> 00:37:54.381
So still, all these lines,
although they are higher

00:37:54.381 --> 00:37:57.282
than the CPU and GPU,
it's still way below the

00:37:57.282 --> 00:38:00.532
theoretical peak operations per second.

00:38:06.031 --> 00:38:08.802
So as I mentioned before,
it is really bottlenecked

00:38:08.802 --> 00:38:11.780
by the low latency requirement
so that it can have

00:38:11.780 --> 00:38:13.402
a large batch size.

00:38:13.402 --> 00:38:16.762
That's why you have low
operations per byte.

00:38:16.762 --> 00:38:18.610
And how do you solve this problem?

00:38:18.610 --> 00:38:21.250
You want to have less
number of memory footprint

00:38:21.250 --> 00:38:25.417
so that it can reduce the
memory bandwidth requirement.

00:38:27.219 --> 00:38:30.449
One solution is to compress
the model and the challenge

00:38:30.449 --> 00:38:35.179
is how do we build a hardware
that can do inference

00:38:35.179 --> 00:38:38.387
directly on the compressed model?

00:38:38.387 --> 00:38:42.238
So I'm going to introduce my
design of EIE, the Efficient

00:38:42.238 --> 00:38:46.347
Inference Engine, which
deals with those sparse

00:38:46.347 --> 00:38:49.755
and the compressed model to
save the memory bandwidth.

00:38:49.755 --> 00:38:52.124
And the rule of thumb, like
we mentioned before is taking

00:38:52.124 --> 00:38:53.995
out one bit of sparsity first.

00:38:53.995 --> 00:38:56.366
Anything times zero is zero.

00:38:56.366 --> 00:38:59.697
So don't store it, don't compute on it.

00:38:59.697 --> 00:39:04.286
And second idea is, you don't
need that much full precision,

00:39:04.286 --> 00:39:06.857
but you can approximate it.

00:39:06.857 --> 00:39:10.279
So by taking advantage
of the sparse weight, we

00:39:10.279 --> 00:39:15.097
get about a 10x saving in
the computation, 5x less

00:39:15.097 --> 00:39:16.345
memory footprint.

00:39:16.345 --> 00:39:19.645
The 2x difference is
due to index overhead.

00:39:19.645 --> 00:39:22.555
And by taking advantage
of the sparse activation,

00:39:22.555 --> 00:39:26.633
meaning after bandwidth,
if activation is zero, then

00:39:26.633 --> 00:39:27.795
ignore it.

00:39:27.795 --> 00:39:30.712
You save another 3x of computation.

00:39:32.454 --> 00:39:35.465
And then by such weight sharing mechanism,

00:39:35.465 --> 00:39:39.382
you can use four bits to
represent each weight rather

00:39:39.382 --> 00:39:41.144
than 32 bit.

00:39:41.144 --> 00:39:45.311
That's another eight times
saving in the memory footprint.

00:39:48.195 --> 00:39:51.894
So this is physically, logically
how the weights are stored.

00:39:51.894 --> 00:39:56.214
A four by eight matrix,
and this is how physically

00:39:56.214 --> 00:39:57.475
they are stored.

00:39:57.475 --> 00:40:00.558
Only the non-zero weights are stored.

00:40:02.294 --> 00:40:04.995
So you don't need to store those zeroes.

00:40:04.995 --> 00:40:07.675
You'll save the bandwidth
fetching those zeroes.

00:40:07.675 --> 00:40:12.334
And also I'm using the
relative index to further save

00:40:12.334 --> 00:40:14.834
the number of memory overhead.

00:40:21.254 --> 00:40:25.634
So in the computation
like this figure shows,

00:40:25.634 --> 00:40:29.801
we are running the
multiplication only on non-zero.

00:40:31.283 --> 00:40:33.533
If it's zero, then skip it.

00:40:34.585 --> 00:40:38.002
Only broadcast it to the non-zero weights

00:40:39.123 --> 00:40:42.131
and if it is zero, skip it.

00:40:42.131 --> 00:40:45.883
If it's a non-zero, do the multiplication.

00:40:45.883 --> 00:40:48.499
In another cycle, do the multiplication.

00:40:48.499 --> 00:40:52.666
So the idea is anything
multiplied by zero is zero.

00:40:54.142 --> 00:40:55.820
So this is a little complicated,

00:40:55.820 --> 00:40:58.283
I'm going to go very quickly.

00:40:58.283 --> 00:41:01.428
I'm going to have a lookup
table that decode the four bit

00:41:01.428 --> 00:41:04.923
weight into the 16 bit
weight and using the four bit

00:41:04.923 --> 00:41:08.083
relative index passed
through address accumulator

00:41:08.083 --> 00:41:11.411
to get the 16 bit absolute index.

00:41:11.411 --> 00:41:13.393
And this is what the hardware architecture

00:41:13.393 --> 00:41:15.323
like in the high level.

00:41:15.323 --> 00:41:19.723
You can feel free to refer
to my paper for detail.

00:41:19.723 --> 00:41:21.523
Okay speedup.

00:41:21.523 --> 00:41:24.203
So using such efficient
hardware architecture

00:41:24.203 --> 00:41:28.203
and also model compression,
this is the original

00:41:29.713 --> 00:41:32.553
result we have seen for
CPU, GPU, mobile GPU.

00:41:32.553 --> 00:41:34.592
Now EIE is here.

00:41:34.592 --> 00:41:38.759
189 times faster than the
CPU and about 13 times faster

00:41:39.833 --> 00:41:40.916
than the GPU.

00:41:43.302 --> 00:41:46.941
So this is the energy
efficiency on the log scale,

00:41:46.941 --> 00:41:50.763
it's about 24,000x more
energy efficient than a CPU

00:41:50.763 --> 00:41:55.043
and about 3000x more energy
efficient than a GPU.

00:41:55.043 --> 00:41:58.318
It means for example,
previously if your battery can

00:41:58.318 --> 00:42:00.934
last for one hour, now it can last for

00:42:00.934 --> 00:42:02.851
3000 hours for example.

00:42:06.174 --> 00:42:09.952
So if you say, ASIC is always
better than CPUs and GPUs

00:42:09.952 --> 00:42:12.294
because it's customized hardware.

00:42:12.294 --> 00:42:16.442
So this is comparing EIE with
the peer ASIC, for example

00:42:16.442 --> 00:42:18.775
DaDianNao and the TrueNorth.

00:42:20.803 --> 00:42:25.305
It has a better throughput,
better energy efficiency

00:42:25.305 --> 00:42:28.825
by order of magnitude,
compared with other ASICs.

00:42:28.825 --> 00:42:31.992
Not to mention that CPU, GPU and FPGAs.

00:42:33.134 --> 00:42:36.384
So we have covered half of the journey.

00:42:37.534 --> 00:42:39.812
We mentioned inference, we pretty much

00:42:39.812 --> 00:42:41.723
covered everything for inference.

00:42:41.723 --> 00:42:44.625
Now we are going to switch
gear and talk about training.

00:42:44.625 --> 00:42:47.011
How do we train neural
networks efficiently,

00:42:47.011 --> 00:42:48.931
how do we train it faster?

00:42:48.931 --> 00:42:51.811
So again, we are starting
with algorithm first,

00:42:51.811 --> 00:42:55.262
efficient algorithms
followed by the hardware

00:42:55.262 --> 00:42:57.179
for efficient training.

00:43:00.479 --> 00:43:03.161
So for efficient training
algorithms, I'm going to mention

00:43:03.161 --> 00:43:04.198
four topics.

00:43:04.198 --> 00:43:07.959
The first one is parallelization,
and then mixed precision

00:43:07.959 --> 00:43:12.131
training, which was just
released about one month ago

00:43:12.131 --> 00:43:15.768
and at NVIDIA GTC,
so it's fresh knowledge.

00:43:15.768 --> 00:43:18.971
And then model distillation,
followed by my work on

00:43:18.971 --> 00:43:20.961
Dense-Sparse-Dense training,
or better Regularization

00:43:20.961 --> 00:43:21.794
technique.

00:43:22.681 --> 00:43:26.121
So let's start with parallelization.

00:43:26.121 --> 00:43:29.542
So this figure shows, anyone in the hardware community.

00:43:29.542 --> 00:43:31.229
Most are very familiar with this figure.

00:43:31.229 --> 00:43:35.038
So as time goes by, what is the trend?

00:43:35.038 --> 00:43:38.422
For the number of transistors
is keeping increasing.

00:43:38.422 --> 00:43:43.030
But the single threaded
performance is getting plateaued

00:43:43.030 --> 00:43:44.371
in recent years.

00:43:44.371 --> 00:43:48.161
And also the frequency is getting
plateaued in recent years.

00:43:48.161 --> 00:43:52.350
Because of the power
constraint, to stop not scaling.

00:43:52.350 --> 00:43:56.517
And interesting thing is the
number of cores is increasing.

00:43:57.757 --> 00:44:00.198
So what we really need
to do is parallelization.

00:44:00.198 --> 00:44:03.427
How do we parallelize the
problem to take advantage

00:44:03.427 --> 00:44:05.827
of parallel processing?

00:44:05.827 --> 00:44:10.804
Actually there are a lot of
opportunities for parallelism

00:44:10.804 --> 00:44:12.756
in deep neural networks.

00:44:12.756 --> 00:44:15.572
For example, we can do data parallel.

00:44:15.572 --> 00:44:20.332
For example, feeding two
images into the same model

00:44:20.332 --> 00:44:23.026
and run them at the same time.

00:44:23.026 --> 00:44:26.156
This doesn't affect
latency for a single input.

00:44:26.156 --> 00:44:30.786
It doesn't make it shorter,
but it makes batch size larger

00:44:30.786 --> 00:44:35.084
basically if you have four
machines our effective batch

00:44:35.084 --> 00:44:38.626
size becomes four times as before.

00:44:38.626 --> 00:44:42.684
So it requires the
coordinated weight update.

00:44:42.684 --> 00:44:46.101
For example, this is a paper from Google.

00:44:46.973 --> 00:44:51.140
There is a parameter server
as a master and a couple of

00:44:52.564 --> 00:44:56.731
slaves running their own piece
of training data and update

00:44:59.032 --> 00:45:03.154
the gradient to the parameter
server and get the updated

00:45:03.154 --> 00:45:05.571
weight for them individually,

00:45:07.312 --> 00:45:11.063
that's how data parallelism is handled.

00:45:11.063 --> 00:45:14.604
Another idea is there could
be a model parallelism.

00:45:14.604 --> 00:45:17.524
You can sublet your model and handle it

00:45:17.524 --> 00:45:21.383
to different processors
or different threads.

00:45:21.383 --> 00:45:25.543
For example, there's this image,
you want to run convolution

00:45:25.543 --> 00:45:29.293
on this image that is
six dimension for loop.

00:45:30.530 --> 00:45:35.271
What you can do is you
can cut the input image by

00:45:35.271 --> 00:45:39.482
two by two blocks so that
each thread, or each processor

00:45:39.482 --> 00:45:42.619
handles one fourth of the image.

00:45:42.619 --> 00:45:45.580
Although there's a small
halo here in between you

00:45:45.580 --> 00:45:47.330
have to take care of.

00:45:48.260 --> 00:45:50.860
And also, you can parallelize by the

00:45:50.860 --> 00:45:53.193
output or input feature map.

00:45:54.730 --> 00:45:56.911
And for those fully connect layers,

00:45:56.911 --> 00:45:58.500
how do we parallelize the model?

00:45:58.500 --> 00:45:59.442
It's even simpler.

00:45:59.442 --> 00:46:02.420
You can cut the model into half

00:46:02.420 --> 00:46:05.337
and hand it to different threads.

00:46:06.551 --> 00:46:07.991
And the third idea, you can even do

00:46:07.991 --> 00:46:09.378
hyper-parameter parallel.

00:46:09.378 --> 00:46:11.762
For example, you can tune
your learning rate, your

00:46:11.762 --> 00:46:14.402
weight decay for different machines

00:46:14.402 --> 00:46:16.400
for those coarse-grained parallelism.

00:46:16.400 --> 00:46:20.780
So there are so many
alternatives you have to tune.

00:46:20.780 --> 00:46:23.631
Small summary of the parallelism.

00:46:23.631 --> 00:46:27.031
There are lots of parallelisms
in deep neural networks.

00:46:27.031 --> 00:46:30.271
For example, with data
parallelism, you can run multiple

00:46:30.271 --> 00:46:34.820
training images, but you
cannot have unlimited number

00:46:34.820 --> 00:46:38.940
of processors because you
are limited by batch size.

00:46:38.940 --> 00:46:42.068
If it's too large, stochastic gradient descent

00:46:42.068 --> 00:46:44.438
becomes gradient descent, that's not good.

00:46:44.438 --> 00:46:47.277
You can also run the model parallelism.

00:46:47.277 --> 00:46:50.466
Split the model, either
by cutting the image or

00:46:50.466 --> 00:46:53.133
cutting the convolution weights.

00:46:58.598 --> 00:47:01.223
Either cutting the image or cutting

00:47:01.223 --> 00:47:03.940
the fully connected layers.

00:47:03.940 --> 00:47:08.319
So it's very easy to get 16
to 64 GPUs training one model

00:47:08.319 --> 00:47:10.490
in parallel, having very good speedup.

00:47:10.490 --> 00:47:12.323
Almost linear speedup.

00:47:13.810 --> 00:47:17.988
Okay, next interesting
thing, mixed precision with

00:47:17.988 --> 00:47:19.071
FP16 or FP32.

00:47:21.319 --> 00:47:23.370
So remember in the
beginning of this lecture,

00:47:23.370 --> 00:47:28.207
I had a chart showing the
energy and area overhead for

00:47:28.207 --> 00:47:30.290
a 16 bit versus a 32 bit.

00:47:31.887 --> 00:47:36.054
Going from 32 bit to 16 bit,
you save about 4x the energy

00:47:37.890 --> 00:47:39.223
and 4x the area.

00:47:40.528 --> 00:47:43.340
So can we train a deep
neural network with such low

00:47:43.340 --> 00:47:47.831
precision with floating point
16 bit rather than 32 bit?

00:47:47.831 --> 00:47:50.998
It turns out we can do that partially.

00:47:53.498 --> 00:47:58.250
By partially, I mean we
need FP32 in some places.

00:47:58.250 --> 00:48:01.090
And where are those places?

00:48:01.090 --> 00:48:05.257
So we can do the multiplication
in 16 bit as input.

00:48:07.951 --> 00:48:11.476
And then we have to do the summation

00:48:11.476 --> 00:48:13.879
in 32 bit accumulation.

00:48:13.879 --> 00:48:18.860
And then convert the result
to 32 bit to store the weight.

00:48:18.860 --> 00:48:22.777
So that's where the mixed
precision comes from.

00:48:25.108 --> 00:48:28.140
So for example, we have
a master weight stored in

00:48:28.140 --> 00:48:31.932
floating point 32, we down
converted it to floating

00:48:31.932 --> 00:48:36.099
point 16 and then we do the
feed forward with 16 bit

00:48:37.612 --> 00:48:42.290
weight, 16 bit activation,
we get a 16 bit activation

00:48:42.290 --> 00:48:46.522
here in the end when we
are doing back propagation

00:48:46.522 --> 00:48:50.689
of the computation is also done
with floating point 16 bit.

00:48:52.700 --> 00:48:57.351
Very interesting here, for
the weights we get a floating

00:48:57.351 --> 00:49:00.851
point 16 bit gradient here for the weight.

00:49:03.255 --> 00:49:07.422
But when we are doing the
update, so W plus learning

00:49:09.598 --> 00:49:13.154
rate times the gradient,
that operation has

00:49:13.154 --> 00:49:14.904
to be done in 32 bit.

00:49:17.740 --> 00:49:20.943
That's where the mixed
precision is coming from.

00:49:20.943 --> 00:49:24.692
And see there are two
colors, which here is 16 bit,

00:49:24.692 --> 00:49:26.514
here is the 32 bit.

00:49:26.514 --> 00:49:30.181
That's where the mixed
precision comes from.

00:49:31.284 --> 00:49:36.212
So does such low precision
sacrifice your prediction

00:49:36.212 --> 00:49:38.884
accuracy for your model?

00:49:38.884 --> 00:49:43.051
So this is the figure from
NVIDIA just released a couple

00:49:43.914 --> 00:49:45.747
of weeks ago actually.

00:49:46.652 --> 00:49:49.819
Thanks to Paulius giving me the slide.

00:49:51.431 --> 00:49:55.751
The convergence between
floating point 32 versus

00:49:55.751 --> 00:49:58.500
the multi tensor up, which
is basically the mixed

00:49:58.500 --> 00:50:00.842
precision training, are
actually pretty much

00:50:00.842 --> 00:50:02.932
the same for convergence.

00:50:02.932 --> 00:50:04.762
If you zoom it in a little bit,

00:50:04.762 --> 00:50:06.690
they are pretty much the same.

00:50:06.690 --> 00:50:11.052
And for ResNet, the mixed
precision sometimes behaves

00:50:11.052 --> 00:50:14.771
a little better than the
full precision weight.

00:50:14.771 --> 00:50:17.234
Maybe because of noise.

00:50:17.234 --> 00:50:20.582
But in the end, after you
train the model, this is

00:50:20.582 --> 00:50:24.762
the result of AlexNet,
Inception V3, and ResNet-50

00:50:24.762 --> 00:50:28.679
with FP32 versus FP16
mixed precision training.

00:50:29.881 --> 00:50:32.721
The accuracy is pretty much the same

00:50:32.721 --> 00:50:33.962
for these two methods.

00:50:33.962 --> 00:50:37.295
A little bit worse, but not by too much.

00:50:40.042 --> 00:50:43.714
So having talked about the
mixed precision training,

00:50:43.714 --> 00:50:47.881
the next idea is to train
with model distillation.

00:50:49.703 --> 00:50:52.412
For example, you can have
multiple neural networks,

00:50:52.412 --> 00:50:55.863
Googlenet, Vggnet, Resnet for example.

00:50:55.863 --> 00:51:00.030
And the question is, can
we take advantage of these

00:51:00.943 --> 00:51:02.092
different models?

00:51:02.092 --> 00:51:05.132
Of course we can do model
ensemble, can we utilitze them

00:51:05.132 --> 00:51:09.299
as teacher, to teach a small
junior neural network to have

00:51:11.201 --> 00:51:15.434
it perform as good as the
senior neural network.

00:51:15.434 --> 00:51:17.090
So this is the idea.

00:51:17.090 --> 00:51:21.257
You have multiple large
powerful senior neural networks

00:51:23.314 --> 00:51:25.202
to teach this student model.

00:51:25.202 --> 00:51:28.881
And hopefully it can get better results.

00:51:28.881 --> 00:51:32.372
And the idea to do that
is, instead of using this

00:51:32.372 --> 00:51:37.162
hard label, for example for
car, dog, cat, the probability

00:51:37.162 --> 00:51:41.329
for dog is 100%, but the
output of the geometric

00:51:42.383 --> 00:51:46.063
ensemble of those large
teacher neural networks

00:51:46.063 --> 00:51:50.230
maybe the dog has 90%
and the cat is about 10%,

00:51:53.282 --> 00:51:55.492
and the magic happens here.

00:51:55.492 --> 00:51:59.071
You want to have a
softened result label here.

00:51:59.071 --> 00:52:03.071
For example, the dog
is 30%, the cat is 20%.

00:52:03.071 --> 00:52:05.471
Still the dog is higher than the cat.

00:52:05.471 --> 00:52:09.260
So the prediction is
still correct, but it uses

00:52:09.260 --> 00:52:13.427
this soft label to train
the student neural network

00:52:15.431 --> 00:52:19.460
rather than use this hard label to train

00:52:19.460 --> 00:52:21.991
the student neural network.

00:52:21.991 --> 00:52:26.740
And mathematically, you
control how much do you make

00:52:26.740 --> 00:52:30.482
it soft by this temperature
during the soft max

00:52:30.482 --> 00:52:33.149
controlling by this temperature.

00:52:34.322 --> 00:52:36.751
And the result is that,
starting with the trained model

00:52:36.751 --> 00:52:40.918
that classifies 58.9% of
the test frames correctly,

00:52:43.099 --> 00:52:46.099
the new model converges to 57%.

00:52:47.340 --> 00:52:50.173
Only train on 3% of the data.

00:52:52.699 --> 00:52:54.882
So that's the magic for model distillation

00:52:54.882 --> 00:52:56.715
using this soft label.

00:52:59.191 --> 00:53:02.460
And the last idea is my recent paper using

00:53:02.460 --> 00:53:06.242
a better regularization
to train deep neural nets.

00:53:06.242 --> 00:53:07.908
We have seen these two figures before.

00:53:07.908 --> 00:53:09.929
We pruned the neural
network, having less number

00:53:09.929 --> 00:53:12.300
of weights, but have the same accuracy.

00:53:12.300 --> 00:53:15.439
Now what I did is to
recover and to retrain those

00:53:15.439 --> 00:53:18.271
weights shown in red
and make everything train

00:53:18.271 --> 00:53:21.625
out together to increase
the model capacity after

00:53:21.625 --> 00:53:24.887
it is trained at a low dimensional space.

00:53:24.887 --> 00:53:27.528
It's like you learn the trunk
first and then gradually

00:53:27.528 --> 00:53:31.071
add those leaves and
learn everything together.

00:53:31.071 --> 00:53:35.238
It turns out, on ImageNet it
performs relatively about 1% to

00:53:37.471 --> 00:53:41.020
4% absolute improvement of accuracy.

00:53:41.020 --> 00:53:44.998
And is also general purpose,
works on long-short term memory

00:53:44.998 --> 00:53:49.330
and also recurrent neural
nets collaborated with Baidu.

00:53:49.330 --> 00:53:52.610
So I also open sourced
this special training model

00:53:52.610 --> 00:53:56.460
on the DSD Model Zoo, where
there are trained, all

00:53:56.460 --> 00:54:00.490
these models, GoogleNet, VGG,
ResNet, and also SqueezeNet,

00:54:00.490 --> 00:54:01.969
and also AlexNet.

00:54:01.969 --> 00:54:05.099
So if you are interested,
feel free to check out this

00:54:05.099 --> 00:54:09.182
Model Zoo and compare it
with the Caffe Model Zoo.

00:54:11.010 --> 00:54:14.998
Here's some examples on
dense-spare-dense training helps

00:54:14.998 --> 00:54:16.581
with image capture.

00:54:17.878 --> 00:54:21.396
For example, this is a
very challenging figure.

00:54:21.396 --> 00:54:24.087
The original baseline of
neural talk says a boy in

00:54:24.087 --> 00:54:27.318
a red shirt is climbing a rock wall.

00:54:27.318 --> 00:54:29.179
And the sparse model says
a young girl is jumping

00:54:29.179 --> 00:54:31.849
off a tree, probably
mistaking the hair with either

00:54:31.849 --> 00:54:33.729
the rock or the tree.

00:54:33.729 --> 00:54:36.278
But then sparse-dense
training by using this kind of

00:54:36.278 --> 00:54:39.100
regularization on a low
dimensional space, it says

00:54:39.100 --> 00:54:42.958
a young girl in a pink shirt
is swinging on a swing.

00:54:42.958 --> 00:54:47.070
And there are a lot of examples
due to the limit of time,

00:54:47.070 --> 00:54:49.129
I will not go over them one by one.

00:54:49.129 --> 00:54:51.150
For example, a group of
people are standing in front

00:54:51.150 --> 00:54:53.118
of a building, there's no building.

00:54:53.118 --> 00:54:55.630
A group of people are walking in the park.

00:54:55.630 --> 00:54:58.550
Feel free to check out the
paper and see more interesting

00:54:58.550 --> 00:54:59.383
results.

00:55:01.420 --> 00:55:05.587
Okay finally, we come to
hardware for efficient training.

00:55:06.478 --> 00:55:08.929
How to we take advantage of the algorithms

00:55:08.929 --> 00:55:10.089
we just mentioned.

00:55:10.089 --> 00:55:14.060
For example, parallelism,
mixed precision, how are

00:55:14.060 --> 00:55:16.630
the hardware designed to actually

00:55:16.630 --> 00:55:19.297
take advantage of such features.

00:55:21.958 --> 00:55:26.041
First GPUs, this is the
Nvidia PASCAL GPU, GP100,

00:55:28.950 --> 00:55:31.367
which was released last year.

00:55:32.289 --> 00:55:35.789
So it supports up to 20 Teraflops on FP16.

00:55:38.048 --> 00:55:40.849
It has 16 gigabytes of
high bandwidth memory.

00:55:40.849 --> 00:55:42.932
750 gigabytes per second.

00:55:46.060 --> 00:55:49.430
So remember, computation
and memory bandwidth are

00:55:49.430 --> 00:55:53.350
the two factors determines
your overall performance.

00:55:53.350 --> 00:55:57.041
Whichever is lower, it will suffer.

00:55:57.041 --> 00:56:01.124
So this is a really high
bandwidth, 700 gigabytes

00:56:02.209 --> 00:56:06.376
compared with DDR3 is just 10
or 30 gigabytes per second.

00:56:08.189 --> 00:56:10.022
Consumes 300 Watts and

00:56:14.147 --> 00:56:17.278
it's done in 16 nanometer process

00:56:17.278 --> 00:56:20.945
and have a 160 gigabytes
per second NV Link.

00:56:22.248 --> 00:56:25.048
So remember we have
computation, we have memory,

00:56:25.048 --> 00:56:28.307
and the third thing is the communication.

00:56:28.307 --> 00:56:31.547
All three factors has to
be balanced in order to

00:56:31.547 --> 00:56:33.797
achieve a good performance.

00:56:35.088 --> 00:56:39.171
So this is very powerful,
but even more exciting,

00:56:40.558 --> 00:56:44.739
just about a month ago,
Jensen released the newest

00:56:44.739 --> 00:56:48.077
architecture called the Volta GPUs.

00:56:48.077 --> 00:56:50.877
And let's see what is
inside the Volta GPU.

00:56:50.877 --> 00:56:55.044
Just released less than a
month ago, so it has 15 of

00:56:57.568 --> 00:57:01.651
FP32 teraflops and what
is new here, there is 120

00:57:03.950 --> 00:57:08.128
Tensor T-OPS, so specifically
designed for deep learning.

00:57:08.128 --> 00:57:11.207
And we'll later cover
what is the tensor core.

00:57:11.207 --> 00:57:13.957
And what is this 120 coming from.

00:57:16.368 --> 00:57:19.699
And rather than 750
gigabytes per second, this

00:57:19.699 --> 00:57:24.499
year, the HBM2, they are
using 900 gigabytes per second

00:57:24.499 --> 00:57:25.678
memory bandwidth.

00:57:25.678 --> 00:57:27.190
Very exciting.

00:57:27.190 --> 00:57:32.139
And 12 nanometer process has
a die size of more than 800

00:57:32.139 --> 00:57:33.248
millimeters square.

00:57:33.248 --> 00:57:37.310
A really large chip and
supported by 300 gigabytes per

00:57:37.310 --> 00:57:38.477
second NVLink.

00:57:40.931 --> 00:57:44.880
So what's new in Volta, the
most interesting thing for us

00:57:44.880 --> 00:57:49.251
for deep learning, is this
thing called Tensor Core.

00:57:49.251 --> 00:57:51.629
So what is a Tensor Core?

00:57:51.629 --> 00:57:56.200
Tensor Core is actually
an instruction that can

00:57:56.200 --> 00:58:00.987
do the four by four matrix
times a four by four matrix.

00:58:00.987 --> 00:58:05.429
The fused FMA stands Fused
Multiplication and Add

00:58:05.429 --> 00:58:08.491
in this mixed precision operation.

00:58:08.491 --> 00:58:11.074
Just in one single clock cycle.

00:58:12.939 --> 00:58:15.698
So let's discern for a little
bit what does this mean.

00:58:15.698 --> 00:58:19.865
So mixed precision is exactly
as we mentioned in the last

00:58:20.699 --> 00:58:24.866
chapter, so we are having
FP16 for the multiplication,

00:58:26.430 --> 00:58:30.430
but for accumulation, we
are doing it with FP32.

00:58:31.928 --> 00:58:35.870
That's where the mixed
precision comes from.

00:58:35.870 --> 00:58:38.657
So let's say how many
operations, if it's four

00:58:38.657 --> 00:58:43.030
by four by four, it's 64
multiplications then just

00:58:43.030 --> 00:58:45.000
in one single cycle.

00:58:45.000 --> 00:58:48.920
That's 12x increase in
the speedup of the Volta

00:58:48.920 --> 00:58:53.087
compared with the Pascal, which
is released just less year.

00:58:55.099 --> 00:58:59.590
So this is the result for
matrix multiplication on

00:58:59.590 --> 00:59:01.288
different sizes.

00:59:01.288 --> 00:59:05.455
The speedup of Volta over
Pascal is roughly 3x faster

00:59:08.928 --> 00:59:11.845
doing these matrix multiplications.

00:59:13.368 --> 00:59:16.790
What we care more is not
only matrix multiplication

00:59:16.790 --> 00:59:19.958
but actually running the deep neural nets.

00:59:19.958 --> 00:59:23.048
So both for training and for inference.

00:59:23.048 --> 00:59:26.630
And for training on
ResNet-50, by taking advantage

00:59:26.630 --> 00:59:29.998
of this Tensor Core in this V100,

00:59:29.998 --> 00:59:33.581
it is 2.4x faster than
the P100 using FP32.

00:59:38.887 --> 00:59:43.054
So on the right hand side,
it compares the inference

00:59:43.899 --> 00:59:48.066
speedup, given a 7 microsecond
latency requirement.

00:59:50.138 --> 00:59:53.910
What is the number of images
per second it can process?

00:59:53.910 --> 00:59:56.459
It has a measurement of throughput.

00:59:56.459 --> 01:00:00.292
Again, the V100 over
P100, by taking advantage

01:00:03.796 --> 01:00:07.796
of the Tensor Core, is
3.7 faster than the P100.

01:00:13.887 --> 01:00:18.745
So this figure gives roughly
an idea, what is a Tensor Core,

01:00:18.745 --> 01:00:22.287
what is an integer unit, what
is a floating point unit.

01:00:22.287 --> 01:00:23.954
So this whole figure

01:00:27.705 --> 01:00:28.872
is a single SM

01:00:33.065 --> 01:00:35.004
stream multiprocessor.

01:00:35.004 --> 01:00:39.495
So SM is partitioned into
four processing blocks.

01:00:39.495 --> 01:00:41.763
One, two, three, four, right?

01:00:41.763 --> 01:00:45.846
And in each block there
are eight FP64 cores here

01:00:48.105 --> 01:00:52.105
and 16 FP32 and 16 INT32
cores here, units here.

01:00:55.751 --> 01:01:00.353
And then there are two of
the new mixed precision

01:01:00.353 --> 01:01:04.520
Tensor cores specifically
designed for deep learning.

01:01:07.641 --> 01:01:10.684
And also there are the one
warp scheduler, dispatch unit

01:01:10.684 --> 01:01:13.513
and Register File, as before.

01:01:13.513 --> 01:01:17.596
So what is new here is
the Tensor core unit here.

01:01:18.935 --> 01:01:23.102
So here is a figure comparing
the recent generations of

01:01:25.722 --> 01:01:27.639
Nvidia GPUs from Kepler

01:01:29.164 --> 01:01:31.664
to Maxwell to Pascal to Volta.

01:01:34.722 --> 01:01:37.425
We can see everything
is keeping improving.

01:01:37.425 --> 01:01:40.733
For example, the boost clock
has been increased from

01:01:40.733 --> 01:01:42.816
about 800 MHz to 1.4 GHz.

01:01:46.563 --> 01:01:50.730
And from the Volta generation
there begins to have

01:01:52.855 --> 01:01:57.022
the Tensor core units here,
which has never existed before.

01:01:59.241 --> 01:02:01.158
And before the Maxwell,

01:02:02.364 --> 01:02:04.781
the GPUs are using the GDDR5,

01:02:07.924 --> 01:02:10.662
and after the Pascal GPU,

01:02:10.662 --> 01:02:12.993
the HBM begins to came into place,

01:02:12.993 --> 01:02:14.593
the high-bandwidth memory.

01:02:14.593 --> 01:02:17.093
750 gigabytes per second here.

01:02:18.543 --> 01:02:22.804
900 gigabytes per second
compared with DDR3,

01:02:22.804 --> 01:02:24.804
30 gigabytes per second.

01:02:27.364 --> 01:02:31.531
And memory size actually
didn't increase by too much,

01:02:34.204 --> 01:02:36.593
and the power consumption is actually

01:02:36.593 --> 01:02:38.783
also remaining roughly the same.

01:02:38.783 --> 01:02:41.844
But giving the increase of
computation, you can fit them

01:02:41.844 --> 01:02:46.712
in the fixed power envelope
that's still an exciting thing.

01:02:46.712 --> 01:02:49.433
And the manufacturing process
is actually improving from

01:02:49.433 --> 01:02:53.600
28 nanometer, 16 nanometer,
all the way to 12 nanometer.

01:02:55.295 --> 01:02:58.033
And the chip area are also increasing to

01:02:58.033 --> 01:03:01.616
800 millimeter-squared,
that's really huge.

01:03:03.084 --> 01:03:07.513
So, you may be interested
in the comparison of the GPU

01:03:07.513 --> 01:03:09.663
with the TPU, right?

01:03:09.663 --> 01:03:12.463
So how do they compare with each other?

01:03:12.463 --> 01:03:15.023
So in the original TPU paper,

01:03:15.023 --> 01:03:18.797
TPU actually designed
roughly in the year of 2015,

01:03:18.797 --> 01:03:22.464
and this is comparison
of the Pascal P40 GPU

01:03:23.673 --> 01:03:25.090
released in 2016.

01:03:27.815 --> 01:03:30.924
So, TPU, the power consumption is lower,

01:03:30.924 --> 01:03:34.273
is larger on chip memory of 24 megabytes,

01:03:34.273 --> 01:03:38.015
really large on-chip SRAM
managed by the software.

01:03:38.015 --> 01:03:42.593
And then both of them
support INT8 operations,

01:03:42.593 --> 01:03:46.760
while the inferences per second
given a 10 nanometer latency

01:03:47.764 --> 01:03:50.484
the comparison for TPU is 1X.

01:03:50.484 --> 01:03:52.651
For the P40 it's about 2X.

01:03:57.975 --> 01:03:59.558
So, just last week,

01:04:01.682 --> 01:04:03.655
in the Google I/O,

01:04:03.655 --> 01:04:06.421
a new nuclear bomb is landed on the Earth.

01:04:06.421 --> 01:04:09.251
That is the Google Cloud TPU.

01:04:09.251 --> 01:04:13.203
So now TPU not only support inference,

01:04:13.203 --> 01:04:15.353
but also support training.

01:04:15.353 --> 01:04:18.622
So there is a very limited
information we can get

01:04:18.622 --> 01:04:20.873
beyond this Google Blog.

01:04:20.873 --> 01:04:24.790
So their Cloud TPU delivers
up to 180 teraflops

01:04:28.713 --> 01:04:32.130
to train and run machine learning models.

01:04:33.422 --> 01:04:36.820
And this is multiple Cloud TPU,

01:04:36.820 --> 01:04:38.903
making it into a TPU pod,

01:04:40.110 --> 01:04:44.963
which is built with 16
the second generation TPUs

01:04:44.963 --> 01:04:48.542
and delivers up to 11.5 teraflops

01:04:48.542 --> 01:04:50.873
of machine learning acceleration.

01:04:50.873 --> 01:04:53.862
So in the Google Blog, they mentioned that

01:04:53.862 --> 01:04:56.420
one of the large scale translation models,

01:04:56.420 --> 01:05:00.881
Google translation models, used
to take a full day to train

01:05:00.881 --> 01:05:05.048
on 32 of best commercially-available
GPUs, probably P40

01:05:06.731 --> 01:05:07.981
or P100, maybe.

01:05:08.902 --> 01:05:11.380
And now it trains to the same accuracy,

01:05:11.380 --> 01:05:15.547
just within one afternoon,
with just 1/8 of a TPU pod,

01:05:17.523 --> 01:05:19.606
which is pretty exciting.

01:05:22.611 --> 01:05:25.273
Okay, so as a little wrap-up.

01:05:25.273 --> 01:05:27.662
We covered a lot of stuff, we've mentioned

01:05:27.662 --> 01:05:30.763
the four dimension space
of algorithm and hardware,

01:05:30.763 --> 01:05:33.993
inference and training, we
covered the algorithms for

01:05:33.993 --> 01:05:36.982
inference, for example,
pruning and quantization,

01:05:36.982 --> 01:05:40.251
Winograd Convolution, binary, ternary,

01:05:40.251 --> 01:05:42.174
weight sharing, for example.

01:05:42.174 --> 01:05:44.603
And then the hardware for
the efficient inference.

01:05:44.603 --> 01:05:46.353
For example, the TPU,

01:05:48.665 --> 01:05:52.523
that take advantage of INT8, integer 8.

01:05:52.523 --> 01:05:56.464
And also my design of EIE
accelerator that take advantage

01:05:56.464 --> 01:05:59.951
of the sparsity, anything
multiplied by zero is zero,

01:05:59.951 --> 01:06:03.201
so don't store it, don't compute on it.

01:06:04.260 --> 01:06:07.131
And also the efficient algorithm
for training, for example,

01:06:07.131 --> 01:06:11.312
how do we do parallelization
and the most recent research on

01:06:11.312 --> 01:06:14.901
how do we use mixed precision
training by taking advantage

01:06:14.901 --> 01:06:18.151
of FP16 rather than FP32 to do training

01:06:19.131 --> 01:06:22.131
which is four times saving the energy

01:06:22.131 --> 01:06:23.939
and four times saving in the area,

01:06:23.939 --> 01:06:27.731
which doesn't quite sacrifice
the accuracy you'll get from

01:06:27.731 --> 01:06:28.814
the training.

01:06:31.803 --> 01:06:35.352
And also Dense-Sparse-Dense
training using better regularization

01:06:35.352 --> 01:06:39.519
sparse regularization, and also
the teacher-student model.

01:06:41.021 --> 01:06:43.741
You have multiple teacher on
your network and have a small

01:06:43.741 --> 01:06:46.461
student network that you
can distill the knowledge

01:06:46.461 --> 01:06:51.072
from the teacher in your
network by a temperature.

01:06:51.072 --> 01:06:54.650
And finally we covered the
hardware for efficient training

01:06:54.650 --> 01:06:57.580
and introduced two nuclear bombs.

01:06:57.580 --> 01:07:01.747
One is the Volta GPU, the
other is the TPU version two,

01:07:02.590 --> 01:07:06.507
the Cloud TPU and also
the amazing Tensor cores

01:07:09.184 --> 01:07:12.771
in the newest generation of Nvidia GPUs.

01:07:12.771 --> 01:07:16.632
And we also revealed the
progression of a wide range,

01:07:16.632 --> 01:07:20.861
the recent Nvidia GPUs
from the Kepler K40,

01:07:20.861 --> 01:07:23.461
that's actually when
I started my research,

01:07:23.461 --> 01:07:25.283
what we used in the beginning,

01:07:25.283 --> 01:07:28.033
all the way to and then K40, M40,

01:07:29.437 --> 01:07:33.213
and then Pascal and then
finally the exciting Volta GPU.

01:07:33.213 --> 01:07:37.380
So every year there is a
nuclear bomb in the spring.

01:07:40.981 --> 01:07:44.992
Okay, a little look ahead in the future.

01:07:44.992 --> 01:07:47.381
So in the future of the city
we can imagine there are a lot

01:07:47.381 --> 01:07:52.301
of AI applications using
smart society, smart care,

01:07:52.301 --> 01:07:56.504
IOT devices, smart retail,
for example, the Amazon Go,

01:07:56.504 --> 01:07:59.984
and also smart home, a lot of scenarios.

01:07:59.984 --> 01:08:03.995
And it poses a lot of challenges
on the hardware design

01:08:03.995 --> 01:08:07.851
that requires the low
latency, privacy, mobility

01:08:07.851 --> 01:08:09.355
and energy efficiency.

01:08:09.355 --> 01:08:12.202
You don't want your battery
to drain very quickly.

01:08:12.202 --> 01:08:15.155
So it's both challenging
and very exciting era

01:08:15.155 --> 01:08:18.904
for the code design for
both the machine learning

01:08:18.904 --> 01:08:20.595
deep neural network model architectures

01:08:20.595 --> 01:08:23.283
and also the hardware architecture.

01:08:23.283 --> 01:08:26.773
So we have moved from
PC era to mobile era.

01:08:26.773 --> 01:08:29.973
Now we are in the AI-First era,

01:08:29.973 --> 01:08:32.818
and hope you are as excited
as I am for this kind of

01:08:32.818 --> 01:08:36.485
brain-inspired cognitive
computing research.

01:08:37.773 --> 01:08:41.962
Thank you for your attention,
I'm glad to take questions.

01:08:41.962 --> 01:08:44.212
[applause]

01:08:50.875 --> 01:08:52.625
We have five minutes.

01:08:54.323 --> 01:08:55.643
Of course.

01:08:55.643 --> 01:08:59.504
- [Student] Can you commercialize
the deep architecture?

01:08:59.504 --> 01:09:04.122
- The architecture, yeah, some
of the ideas are pretty good.

01:09:04.122 --> 01:09:06.583
I think there's opportunity.

01:09:06.584 --> 01:09:07.417
Yeah.

01:09:11.841 --> 01:09:12.674
Yeah.

01:09:30.091 --> 01:09:34.258
The question is, what can we
do to make the hardware better?

01:09:46.997 --> 01:09:48.979
Oh, right, the question is how do we,

01:09:48.979 --> 01:09:51.917
the challenges and what
opportunity for those small

01:09:51.917 --> 01:09:54.699
embedded devices around
deep neural network

01:09:54.699 --> 01:09:57.006
or in general AI algorithms.

01:09:57.006 --> 01:10:00.673
Yeah, so those are the
algorithm I discussed

01:10:02.197 --> 01:10:04.947
in the beginning about inference.

01:10:06.309 --> 01:10:07.142
Here.

01:10:08.579 --> 01:10:12.448
These are the techniques
that can enable such

01:10:12.448 --> 01:10:15.107
inference or AI running
on embedded devices,

01:10:15.107 --> 01:10:18.448
by having less number of
weights, fewer bits per weight,

01:10:18.448 --> 01:10:20.648
and also quantization,
low rank approximation.

01:10:20.648 --> 01:10:24.397
The small matrix, same
accuracy, even going to binary,

01:10:24.397 --> 01:10:27.808
or ternary weights having just two bits

01:10:27.808 --> 01:10:31.288
to do the computation rather
than 16 or even 32 bit

01:10:31.288 --> 01:10:33.745
and also the Winograd Transformation.

01:10:33.745 --> 01:10:36.456
Those are also the enabling
algorithms for those

01:10:36.456 --> 01:10:38.706
low-power embedded devices.

01:10:57.356 --> 01:11:02.189
Okay, the question is, if it's
binary weight, the software

01:11:02.189 --> 01:11:06.356
developers may be not able
to take advantage of it.

01:11:07.509 --> 01:11:11.418
There is a way to take
advantage of binary weight.

01:11:11.418 --> 01:11:14.418
So in one register there are 32 bit.

01:11:16.538 --> 01:11:19.827
Now you can think of it
as a 32-way parallelism.

01:11:19.827 --> 01:11:22.457
Each bit is a single operation.

01:11:22.457 --> 01:11:25.120
So say previously we
have 10 ops per second.

01:11:25.120 --> 01:11:27.703
Now you get 330 ops per second.

01:11:31.000 --> 01:11:33.917
You can do this bitwise operations.

01:11:34.960 --> 01:11:37.287
For example, XOR operations.

01:11:37.287 --> 01:11:39.368
So one register file,

01:11:39.368 --> 01:11:42.285
one operation becomes 32 operation.

01:11:43.608 --> 01:11:47.058
So there is a paper called XORmad,

01:11:47.058 --> 01:11:49.845
they very amazing implemented

01:11:49.845 --> 01:11:52.637
on the Raspberry Pi using this feature

01:11:52.637 --> 01:11:55.907
to do real-time detection,
very cool stuff.

01:11:55.907 --> 01:11:56.740
Yeah.

01:12:11.779 --> 01:12:15.946
Yeah, so the trade-off is
always so the power area

01:12:16.956 --> 01:12:19.819
and performance in general,
all the hardware design

01:12:19.819 --> 01:12:23.298
have to take into account
the performance, the power,

01:12:23.298 --> 01:12:24.798
and also the area.

01:12:26.158 --> 01:12:29.387
When machine learning
comes, there's a fourth

01:12:29.387 --> 01:12:32.107
figure of merit which is the accuracy.

01:12:32.107 --> 01:12:34.089
What is the accuracy?

01:12:34.089 --> 01:12:37.019
And there is a fifth one
which is programmability.

01:12:37.019 --> 01:12:39.089
So how general is your hardware?

01:12:39.089 --> 01:12:42.089
For example, if Google just
want to use that for AI

01:12:42.089 --> 01:12:45.507
and deep learning, it's totally fine

01:12:45.507 --> 01:12:48.635
that we can have a fully
very specialized architecture

01:12:48.635 --> 01:12:51.206
just for deep learning
to support convolution,

01:12:51.206 --> 01:12:54.307
multi-layered perception,
long-short-term memory,

01:12:54.307 --> 01:12:58.224
but GPUS, you also want
to have support for those

01:13:00.067 --> 01:13:03.734
scientific computing
or graphics, AR and VR.

01:13:04.915 --> 01:13:07.998
So that's a difference, first of all.

01:13:10.804 --> 01:13:14.244
And TPU basically is a ASIC, right?

01:13:14.244 --> 01:13:16.987
It's a very fixed function
but you can still program it

01:13:16.987 --> 01:13:21.587
with those coarse instructions
so people from Google

01:13:21.587 --> 01:13:24.755
roughly designed those coarse
granularity instruction.

01:13:24.755 --> 01:13:27.467
For example, one instruction
just load the matrix,

01:13:27.467 --> 01:13:29.795
store a matrix, do convolutions,

01:13:29.795 --> 01:13:31.507
do matrix multiplications.

01:13:31.507 --> 01:13:34.377
Those coarse-grain instructions

01:13:34.377 --> 01:13:37.710
and they have a software-managed memory,

01:13:38.605 --> 01:13:40.558
also called a scratchpad.

01:13:40.558 --> 01:13:43.885
It's different from
cache where it determines

01:13:43.885 --> 01:13:47.217
where to evict something
from the cache, but now,

01:13:47.217 --> 01:13:49.845
since you know the computation pattern,

01:13:49.845 --> 01:13:53.512
there's no need to do out-of-order execution,

01:13:54.446 --> 01:13:57.066
to do branch prediction, no such things.

01:13:57.066 --> 01:14:00.255
Everything is determined,
so you can take the multi of

01:14:00.255 --> 01:14:04.422
it and maintain a fully
software-managed scratchpad

01:14:05.337 --> 01:14:09.897
to reduce the data movement
and remember, data movement

01:14:09.897 --> 01:14:13.084
is the key for reducing
the memory footprint

01:14:13.084 --> 01:14:14.606
and energy consumption.

01:14:14.606 --> 01:14:15.439
So, yeah.

01:14:26.633 --> 01:14:30.313
Mobilia and Nobana architectures
actually I'm not quite

01:14:30.313 --> 01:14:33.813
familiar, didn't prepare those slides, so,

01:14:34.736 --> 01:14:37.569
comment it a little bit later, no.

01:14:52.428 --> 01:14:54.507
Oh, yeah, of course.

01:14:54.507 --> 01:14:57.778
Those are always and
can certainly be applied

01:14:57.778 --> 01:15:00.269
to low-power embedded devices.

01:15:00.269 --> 01:15:03.686
If you're interested, I can show you a...

01:15:04.629 --> 01:15:05.462
Whoops.

01:15:06.971 --> 01:15:08.888
Some examples of, oops.

01:15:10.689 --> 01:15:11.859
Where is that?

01:15:11.859 --> 01:15:15.731
Of my previous projects
running deep neural nets.

01:15:15.731 --> 01:15:19.394
For example, on a drone,
this is using a Nvidia TK1

01:15:19.394 --> 01:15:23.561
mobile GPU to do real-time
tracking and detection.

01:15:26.691 --> 01:15:28.898
This is me playing my nunchaku.

01:15:28.898 --> 01:15:32.898
Filmed by a drone to do the
detection and tracking.

01:15:34.672 --> 01:15:38.939
And also, this FPGA doing
the deep neural network.

01:15:38.939 --> 01:15:41.039
It's pretty small.

01:15:41.039 --> 01:15:44.611
This large, doing the face-alignment and

01:15:44.611 --> 01:15:48.194
detecting the eyes,
the nose and the mouth,

01:15:49.352 --> 01:15:51.602
at a pretty high framerate.

01:15:53.151 --> 01:15:55.401
Consuming only three watts.

01:15:56.918 --> 01:16:00.689
This is a project I did
at Facebook doing the

01:16:00.689 --> 01:16:03.269
deep neural nets on the mobile phone to do

01:16:03.269 --> 01:16:06.781
image classification, for
example, it says it's a laptop,

01:16:06.781 --> 01:16:10.389
or you can feed it with
an image and it says

01:16:10.389 --> 01:16:14.480
it's a selfie, has person
and the face, et cetera.

01:16:14.480 --> 01:16:17.621
So there's lots of opportunity for those

01:16:17.621 --> 01:16:21.788
embedded or mobile-deployment
of deep neural nets.

01:16:30.419 --> 01:16:32.288
No, there is a team doing that,

01:16:32.288 --> 01:16:34.808
but I cannot comment too much, probably.

01:16:34.808 --> 01:16:38.975
There is a team at Google
doing that sort of stuff, yeah.

01:16:44.876 --> 01:16:46.208
Okay, thanks, everyone.

01:16:46.208 --> 00:00:00.000
If you have any questions,
feel free to drop me a e-mail.